library("here")
library("edgeR")
library("tibble")
library("dplyr")
library("ggplot2")
library("DESeq2")
library("ade4")
library("kableExtra")
library("SummarizedExperiment")
library("pasilla")
library("pheatmap")
library("factoextra")
library("MASS")
library("pheatmap")
library("lattice")
library("ggcorrplot")
library("ggfortify")
library("ggrepel")
library("pracma")
library("ggridges")

Chapter Overview

High-throughput sequencing assays that provide quantitative readouts in the form of count data include;

-RNA-Seq:RNA molecules found in a population of cells or in a tissue

-ChIP-Seq: DNA regions that are bound to particular DNA-binding proteins (selected by immuno-precipitation)

-RIP-Seq:RNA molecules or regions of them bound to a particular RNA-binding protein

-DNA-Seq: genomic DNA with prevalence of genetic variants in heterogeneous populations of cells

-HiC: map the 3D spatial arrangement of DNA

-genetic screens: proliferation or survival of cells upon gene knockdown

-microbiome: abundance of different microbial species in complex microbial habitats.

Analyzing such data requires elaborate statistical techniques and considerations. This chapter addresses on such techniques with a special focus on RNA-seq.

**Basic RNA-seq Workflow**

Basic RNA-seq Workflow

8.1 Chapter Goals

The main aim is to detect and quantify systematic changes between samples from different conditions.

Statstical concepts and tools that will be employed include; - multifactorial designs, linear models and analysis of variance

8.2 Some core concepts

8.3 Count data

We bigin with gene count data from an experiment on effect of RNAi knockdown of Pasilla (Brooks et al. (2011)), the Drosophila melanogaster ortholog of mammalian NOVA1 and NOVA2, on the transcriptome.

fn = system.file("extdata", "pasilla_gene_counts.tsv",
                  package = "pasilla", mustWork = TRUE)
counts = as.matrix(read.csv(fn, sep = "\t", row.names = "gene_id"))
dim(counts)
## [1] 14599     7
counts[ 2000+(0:3), ]
##             untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0020369       3387       4295       1315       1853     4884     2133
## FBgn0020370       3186       4305       1824       2094     3525     1973
## FBgn0020371          1          0          1          1        1        0
## FBgn0020372         38         84         29         28       63       28
##             treated3
## FBgn0020369     2165
## FBgn0020370     2120
## FBgn0020371        0
## FBgn0020372       27

8.3.1 The challenges of count data

-Heteroskedasticity

-Non-symmetric distribution

-Systematic sampling biases

-Sources of stochastic experimental variation

Designed RNA-seq experiments are usually limited in in replicates used and hence lack enough power. Hence distributional assumptions that let us compute the probabilities of rare events in the tails of the distribution are made.

8.3.2 RNA-Seq: what about gene structures, splicing, isoforms?

These are approched differently than a DGE analysis.

Isoforms: Gene isoforms are mRNAs that are produced from the same locus but are different in their transcription start sites (TSSs), protein coding DNA sequences (CDSs) and/or untranslated regions (UTRs), potentially altering gene function. The analysis of DAST is challenging because the isoform origin for only a small fraction of the sequenced reads can be determined in a typical RNA-seq dataset Hu et al. (2018).

Analyses of such gene structures are termed, Differential alternative splicing or transcription (DAST). This is done by tools such as:

  • PennDiff

  • Cuffdiff

  • SplicingCompass

  • rSeqDiff

  • DiffSplice

8.4 Modeling count data

8.4.1 Dispersion

Gene expression variability across technical replicates, has been shown to approximately follow a Poisson distribution, for which the variance is equal to the mean.

Thus the probability that a given read maps to the \(i^{th}\) gene is \(p_i = n_i/n\), and that this is pretty much independent of the outcomes for all the other reads. This can be modelled as a poison distribution with rate; \(λ_i=rp_i\) where \(r\) is the number of reads.

However, biological replication introduces additional cross-sample variability, and analysis frameworks therefore have resorted to the usage of the gamma-Poisson or the negative binomial (NB) distribution which has an additional dispersion parameter and a quadratic mean-variance relationship.

8.4.2 Normalization

The observed counts of the features cannot be directly compared across samples since there are differences in sequencing depth across libraries; hence the need for scaling/normalizing the counts.

► Question

For the example dataset count of Section 8.3, how does the output of DESeq2’s estimateSizeFactorsForMatrix compare to what you get by simply taking the column sums?

ggplot(tibble(
  `size factor` = estimateSizeFactorsForMatrix(counts),
  `sum` = colSums(counts)), aes(x = `size factor`, y = `sum`)) +
  geom_point()

► Task

Locate the R sources for this book and have a look at the code that produces Figure 8.1.

szfcDemo = data.frame(
  x = c(2, 4, 6, 6,  8) * 10,
  y = c(3, 6, 2, 9, 12) * 10,
  name = LETTERS[1:5],
  check.names = FALSE)
slopes =  c(
  blue = with(szfcDemo, sum(y) / sum(x)),
  red = szfcDemo[, c("x", "y")] %>% as.matrix %>%
    (DESeq2::estimateSizeFactorsForMatrix) %>% (function(x) x[2]/x[1]) %>% as.vector)
ggplot(szfcDemo, aes(x = x, y = y, label = name)) + geom_point() +
  coord_fixed() + xlim(c(0, 128)) + ylim(c(0, 128)) + xlab("sample 1") + ylab("sample 2") +
  geom_text(hjust= 0.5, vjust = -0.6) +
  geom_abline(slope = slopes[1], col = names(slopes)[1]) +
  geom_abline(slope = slopes[2], col = names(slopes)[2])

► Question

Plot the mean-variance relationship for the biological replicates in the pasilla dataset.

library("ggplot2")
library("matrixStats")
sf = estimateSizeFactorsForMatrix(counts)
ncounts  = counts / matrix(sf,
   byrow = TRUE, ncol = ncol(counts), nrow = nrow(counts))
uncounts = ncounts[, grep("^untreated", colnames(ncounts)),
                     drop = FALSE]
ggplot(tibble(
        mean = rowMeans(uncounts),
        var  = rowVars( uncounts)),
     aes(x = log(mean), y = log(var))) +
  geom_hex() + coord_fixed() + theme(legend.position = "none") +
  geom_abline(slope = 1:2, color = c("forestgreen", "red"))
## Warning: Removed 2713 rows containing non-finite values (stat_binhex).

The green line is expected when mean equals the mean. The red line (slope 2) corresponds to the quadratic mean-variance relationship \(v=m^2\)

8.5 A basic analysis

Load the pasilla data.

annotationFile = system.file("extdata",
  "pasilla_sample_annotation.csv",
  package = "pasilla", mustWork = TRUE)
pasillaSampleAnno = readr::read_csv(annotationFile)
## Parsed with column specification:
## cols(
##   file = col_character(),
##   condition = col_character(),
##   type = col_character(),
##   `number of lanes` = col_double(),
##   `total number of reads` = col_character(),
##   `exon counts` = col_double()
## )
pasillaSampleAnno
## # A tibble: 7 x 6
##   file     condition type     `number of lane… `total number of r… `exon counts`
##   <chr>    <chr>     <chr>               <dbl> <chr>                       <dbl>
## 1 treated… treated   single-…                5 35158667                 15679615
## 2 treated… treated   paired-…                2 12242535 (x2)            15620018
## 3 treated… treated   paired-…                2 12443664 (x2)            12733865
## 4 untreat… untreated single-…                2 17812866                 14924838
## 5 untreat… untreated single-…                6 34284521                 20764558
## 6 untreat… untreated paired-…                2 10542625 (x2)            10283129
## 7 untreat… untreated paired-…                2 12214974 (x2)            11653031

Replace type column with underscores and and convert the type and condition columns into factors, explicitly specifying the prefered order of the levels.

pasillaSampleAnno = mutate(pasillaSampleAnno,
condition = factor(condition, levels = c("untreated", "treated")),
type = factor(sub("-.*", "", type), levels = c("single", "paired")))

with(pasillaSampleAnno,
       table(condition, type))
##            type
## condition   single paired
##   untreated      2      2
##   treated        1      2

Deseq2 uese DESeqDataSet to store the datasets and related metadata.

Create a DESeqDataSet from the count data matrix counts and the sample annotation dataframe pasillaSampleAnno.

mt = match(colnames(counts), sub("fb$", "", pasillaSampleAnno$file))
stopifnot(!any(is.na(mt)))


pasilla = DESeqDataSetFromMatrix(
  countData = counts,
  colData   = pasillaSampleAnno[mt, ],
  design    = ~ condition)
class(pasilla)
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"

► Question

How can we access the row metadata of a SummarizedExperiment object, i.e., how can we read it out, how can we change it?

colData(pasilla)
## DataFrame with 7 rows and 6 columns
##                    file condition     type number of lanes
##             <character>  <factor> <factor>       <numeric>
## untreated1 untreated1fb untreated   single               2
## untreated2 untreated2fb untreated   single               6
## untreated3 untreated3fb untreated   paired               2
## untreated4 untreated4fb untreated   paired               2
## treated1     treated1fb   treated   single               5
## treated2     treated2fb   treated   paired               2
## treated3     treated3fb   treated   paired               2
##            total number of reads exon counts
##                      <character>   <numeric>
## untreated1              17812866    14924838
## untreated2              34284521    20764558
## untreated3         10542625 (x2)    10283129
## untreated4         12214974 (x2)    11653031
## treated1                35158667    15679615
## treated2           12242535 (x2)    15620018
## treated3           12443664 (x2)    12733865

8.5.2 The DESeq2 method

Runing the deseq function does three things;

-estimateSizeFactors

-estimateDispersions

-nbinomWaldTest

pasilla = DESeq(pasilla)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#View the results
res = results(pasilla)
nrow(res)
## [1] 14599
res[order(res$padj), ] %>% head
## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 6 rows and 6 columns
##                     baseMean    log2FoldChange              lfcSE
##                    <numeric>         <numeric>          <numeric>
## FBgn0039155 730.595806139728 -4.61901333325316  0.168706777691263
## FBgn0025111 1501.41051323996  2.89986434281464  0.126920536499104
## FBgn0029167 3706.11653071978 -2.19700018176299 0.0969888426793355
## FBgn0003360 4343.03539692487 -3.17967219633152  0.143526399548808
## FBgn0035085 638.232608936723 -2.56041189472443  0.137295200724247
## FBgn0039827 261.916235943103 -4.16251579404328  0.232588799612089
##                          stat                pvalue                  padj
##                     <numeric>             <numeric>             <numeric>
## FBgn0039155 -27.3789434927507 4.88598917049146e-165 4.06660878660004e-161
## FBgn0025111  22.8478733450288 1.53429589608622e-115 6.38497237156281e-112
## FBgn0029167 -22.6520919424383 1.33042414468249e-113 3.69104005206412e-110
## FBgn0003360 -22.1539187656569 9.56283102522569e-109 1.98978606557384e-105
## FBgn0035085 -18.6489540873823  1.28771779124823e-77  2.14353503531181e-74
## FBgn0039827 -17.8964584751523  1.25663334671151e-71  1.74315989077998e-68

8.5.3 Exploring the results

FOur exploraoty plots can b generated; can be e

#Histogram of p-values
ggplot(as(res, "data.frame"), aes(x = pvalue)) +
  geom_histogram(binwidth = 0.01, fill = "Royalblue", boundary = 0)
## Warning: Removed 2241 rows containing non-finite values (stat_bin).

The dispersion can be observed as

plotDispEsts(pasilla)

► Question

If the histogram for your data is indicative of batch effects, what can you do?

Ue sva or RUVSeq to correct for the batch effects especially if they are not known.

plotMA(pasilla, ylim = c( -2, 2))

#PCA plot of the log transformed data
pas_rlog = rlogTransformation(pasilla)
plotPCA(pas_rlog, intgroup=c("condition", "type")) + coord_fixed()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

#Heatmap
library("pheatmap")
select = order(rowMeans(assay(pas_rlog)), decreasing = TRUE)[1:30]
pheatmap( assay(pas_rlog)[select, ],
     scale = "row",
     annotation_col = as.data.frame(
        colData(pas_rlog)[, c("condition", "type")] ))

By default, pheatmap arranges the rows and columns of the matrix by the dendrogram from (unsupervised) clustering.

#Explort results
write.csv(as.data.frame(res), file = "treated_vs_untreated.csv")

8.6 Critique of default choices and possible modifications

8.6.1 Few changes assumption

Underlying the default normalization and the dispersion estimation in DESeq2 (and many other differential expression methods) is that most genes are not differentially expressed.

What to do if assumption is not correct?

- Solution: Don't apply operations on all genes

- identify subset of -ve control genes which we belive the assumption holds

- because prior knowledge or controlled abundance as external "spike in" (PhiX) as used in MACQ2 experiment.
    

NB: For the normalization, although not for the dispersion estimation, one can slightly relax this assumption: it is still valid if many genes are changing, but in a way that is balanced between up- and downward directions.

8.6.1 The few changes assumption

► Task

Run the DESeq2 workflow with size factors and dispersion parameters estimated only from a predefined subset of genes.

pasilla1 = DESeqDataSetFromMatrix(
  countData = counts,
  colData   = pasillaSampleAnno[mt, ],
  design    = ~ condition)
class(pasilla1)
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
p= estimateSizeFactors(pasilla1, controlGenes= (1:130))
p1=estimateDispersions(p)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
p1=nbinomWaldTest(p1)
res1=results(p1)
nrow(res1)
## [1] 14599
res[order(res1$padj), ] %>% head
## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 6 rows and 6 columns
##                     baseMean    log2FoldChange              lfcSE
##                    <numeric>         <numeric>          <numeric>
## FBgn0039155 730.595806139728 -4.61901333325316  0.168706777691263
## FBgn0029167 3706.11653071978 -2.19700018176299 0.0969888426793355
## FBgn0025111 1501.41051323996  2.89986434281464  0.126920536499104
## FBgn0003360 4343.03539692487 -3.17967219633152  0.143526399548808
## FBgn0035085 638.232608936723 -2.56041189472443  0.137295200724247
## FBgn0039827 261.916235943103 -4.16251579404328  0.232588799612089
##                          stat                pvalue                  padj
##                     <numeric>             <numeric>             <numeric>
## FBgn0039155 -27.3789434927507 4.88598917049146e-165 4.06660878660004e-161
## FBgn0029167 -22.6520919424383 1.33042414468249e-113 3.69104005206412e-110
## FBgn0025111  22.8478733450288 1.53429589608622e-115 6.38497237156281e-112
## FBgn0003360 -22.1539187656569 9.56283102522569e-109 1.98978606557384e-105
## FBgn0035085 -18.6489540873823  1.28771779124823e-77  2.14353503531181e-74
## FBgn0039827 -17.8964584751523  1.25663334671151e-71  1.74315989077998e-68

There are differences in the p values

8.7 Multi-factor designs and linear models

These refer to experiments where more than one condition are assessed. this will require an extension of the linear model to accommodate all factors.

8.7.1 What is a multifactorial design?

This refers to experiments with more than one factor influencing the counts.

8.7.4 Robustness

► Question

Plot the graph of the function proposed by Huber (1964) for M-estimators.

rho = function(x, s)
  ifelse(abs(x) < s, x^2 / 2,  s * abs(x) - s^2 / 2)

df = tibble(
  x        = seq(-7, 7, length.out = 100),
  parabola = x ^ 2 / 2,
  Huber    = rho(x, s = 2))

ggplot(reshape2::melt(df, id.vars = "x"),
  aes(x = x, y = value, col = variable)) + geom_line()

8.9 Two-factor analysis of the pasilla data

Including the type and conditions into the design, a two factor formula, a two factor analysis with DESeq2 can be executed.

pasillaTwoFactor = pasilla
design(pasillaTwoFactor) = formula(~ type + condition)
pasillaTwoFactor = DESeq(pasillaTwoFactor)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#View the results
res2 = results(pasillaTwoFactor)
head(res2, n = 3)
## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 3 rows and 6 columns
##                      baseMean      log2FoldChange             lfcSE
##                     <numeric>           <numeric>         <numeric>
## FBgn0000003 0.171568715207063   0.674551784677737  3.87109094992796
## FBgn0000008  95.1440789963134 -0.0406731386938223 0.222214994845543
## FBgn0000014  1.05657219346166 -0.0849879584016901  2.11182144008079
##                            stat            pvalue              padj
##                       <numeric>         <numeric>         <numeric>
## FBgn0000003   0.174253664768659  0.86166611221445                NA
## FBgn0000008  -0.183035077007712 0.854770496036173 0.951974876389847
## FBgn0000014 -0.0402439130452425 0.967898668421209                NA

Comparing the P-values from both the two factor and one factor analysis;

trsf = function(x) ifelse(is.na(x), 0, (-log10(x)) ^ (1/6))
ggplot(tibble(pOne = res$pvalue,
              pTwo = res2$pvalue),
    aes(x = trsf(pOne), y = trsf(pTwo))) +
    geom_hex(bins = 75) + coord_fixed() +
    xlab("Single factor analysis (condition)") +
    ylab("Two factor analysis (type + condition)") +
    geom_abline(col = "orange")

The p-values in the two-factor analysis are similar to those from the one-factor analysis, but are generally smaller.

Compare the p-values

compareRes = table(
   `simple analysis` = res$padj < 0.1,
   `two factor` = res2$padj < 0.1 )
addmargins( compareRes )
##                two factor
## simple analysis FALSE TRUE  Sum
##           FALSE  6973  289 7262
##           TRUE     25 1036 1061
##           Sum    6998 1325 8323

The two-factor analysis found 1325 genes differentially expressed at an FDR threshold of 10%, while the one-factor analysis found 1061.

Without modeling the blocking factor, the variability in the data that is due to it has to be absorbed by the \(ε\)s. This means that they are generally larger than in the model with the blocking factor.

Not taking into account a blocking factor can also lead to the detection of more genes

8.10 Further statistical concepts

8.10.1 Sharing of dispersion information across genes

  • DESeq2 uses an empirical Bayes approach for the estimation of the dispersion parameters with the priors taken from the distributions of the maximum-likelihood estimates (MLEs) across all genes

8.10.2 Count data transformations

Transformations are useful for downstream analysis and data exploration such as clustering.

Common transformation approaches are;

vsp = varianceStabilizingTransformation(pasilla)

-Regularized logarithm (rlog) transformation

Comparing vst and log2 transformation

j = 1
ggplot(tibble(
         x    = assay(pasilla)[, j],
         VST  = assay(vsp)[, j],
         log2 = log2(assay(pasilla)[, j])) %>%
             reshape2::melt(id.vars = "x"),
       aes(x = x, y = value, col = variable)) +
  geom_line() + xlim(c(0, 600)) + ylim(c(0, 9)) +
  xlab("counts") + ylab("transformed")
## Warning: Removed 8234 rows containing missing values (geom_path).

VST

It is conceptually distinct from variance stabilization in that it builds upon the shrinkage estimation.

It works by transforming the original count data to a \(log_2\)-like scale by fitting a “trivial” model with a separate term for each sample and a prior distribution on the coefficients which is estimated from the data.

► Question

Plot mean against standard deviation between replicates for the shifted logarithm ((8.17)), the regularized log transformation and the variance-stabilizing transformation.

library("vsn")
rlp = rlogTransformation(pasilla)

msd = function(x)
  meanSdPlot(x, plot = FALSE)$gg + ylim(c(0, 1)) +
     theme(legend.position = "none")

gridExtra::grid.arrange(
  msd(log2(counts(pasilla, normalized = TRUE) + 1)) +
    ylab("sd(log2)"),
  msd(assay(vsp)) + ylab("sd(vst)"),
  msd(assay(rlp)) + ylab("sd(rlog)"),
  ncol = 3
)
## Warning: Removed 463 rows containing non-finite values (stat_binhex).
## Warning: Removed 14 rows containing missing values (geom_hex).
## Warning: Removed 21 rows containing non-finite values (stat_binhex).
## Warning: Removed 9 rows containing missing values (geom_hex).
## Warning: Removed 2 rows containing non-finite values (stat_binhex).
## Warning: Removed 20 rows containing missing values (geom_hex).

8.10.3 Dealing with outliers

Outliers are extremely large counts They arise through;

-rare technical or experimental artifacts

-read mapping problems in the case of genetically differing samples,

-genuine, but rare biological events.

In Deseq2;

-The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA.

par(mar=c(8,5,2,2))
boxplot(log10(assays(pasillaTwoFactor)[["cooks"]]), range=0, las=2)

8.10.4 Tests of \(log_2\) fold change above or below a threshold

To detect effects that have a strong enough size, as opposed to ones that are statistically significant, the results function allow for threshold-based Wald tests: lfcThreshold, which takes a numeric of a non-negative threshold value, and altHypothesis which specifies the kind of test.

par(mfrow = c(4, 1), mar = c(2, 2, 1, 1))
myMA = function(h, v, theta = 0.5) {
  plotMA(pasilla, lfcThreshold = theta, altHypothesis = h,
         ylim = c(-2.5, 2.5))
  abline(h = v * theta, col = "dodgerblue", lwd = 2)
}
myMA("greaterAbs", c(-1, 1))
myMA("lessAbs",    c(-1, 1))
myMA("greater",          1)
myMA("less",         -1   )

Exercise

pasillaEdge1 <- DGEList(counts=counts, sample = pasillaSampleAnno[mt,], group=pasillaSampleAnno$condition)
design <- model.matrix(~ condition, pasillaEdge1$samples)
class(pasillaEdge1)
## [1] "DGEList"
## attr(,"package")
## [1] "edgeR"
y <- calcNormFactors(pasillaEdge1, method="TMM")
plotMDS(pasillaEdge1)

y1 <- estimateDisp(y, design, robust=TRUE)
y1$common.dispersion
## [1] 0.0228685
plotBCV(y1)

#Generalized Linear Models with Quasi-likelihood Tests

glmfit <- glmQLFit(y1, design, robust=TRUE)
plotQLDisp(glmfit)

results <- glmQLFTest(glmfit, coef = 2)
topTags(results)
## Coefficient:  conditiontreated 
##                 logFC   logCPM        F       PValue          FDR
## FBgn0039155 -4.601317 5.882317 958.0935 3.023628e-17 4.414194e-13
## FBgn0025111  2.905756 6.923428 703.5673 4.791924e-16 3.497865e-12
## FBgn0003360 -3.173036 8.452776 462.5921 1.973292e-14 9.602698e-11
## FBgn0035085 -2.548257 5.684922 437.2499 3.239556e-14 1.182357e-10
## FBgn0039827 -4.142255 4.397963 419.5859 4.653398e-14 1.358699e-10
## FBgn0034736 -3.492036 4.186934 385.9148 9.683601e-14 2.072631e-10
## FBgn0029167 -2.188103 8.221274 384.7723 9.937952e-14 2.072631e-10
## FBgn0029896 -2.434452 5.305827 333.8934 3.421243e-13 6.243341e-10
## FBgn0000071  2.685868 4.795202 300.7758 8.457310e-13 1.371870e-09
## FBgn0034434 -3.624878 3.214994 264.6460 2.545131e-12 3.715636e-09
is.de <- decideTests(results, p.value=0.05)
summary(is.de)
##        conditiontreated
## Down                327
## NotSig            13919
## Up                  353
plotMD(results)
abline(h=c(-1, 1), col="blue")

##the glmLRT approach##
#Conduct likelihood ratio tests for 1 vs 2 and show the top genes:
#First fit genewise glms:
fit1 <- glmFit(y1, design)
lrt <- glmLRT(fit1, coef = 2)
topTags(lrt)
## Coefficient:  conditiontreated 
##                 logFC   logCPM       LR        PValue           FDR
## FBgn0039155 -4.604477 5.882317 686.2611 2.906395e-151 4.243046e-147
## FBgn0025111  2.906356 6.923428 498.6996 1.823501e-110 1.331065e-106
## FBgn0003360 -3.173027 8.452776 352.3142  1.327984e-78  6.462413e-75
## FBgn0039827 -4.143209 4.397963 327.8901  2.769069e-73  1.010641e-69
## FBgn0029167 -2.187794 8.221274 307.5157  7.592933e-69  2.216985e-65
## FBgn0035085 -2.550032 5.684922 303.0060  7.292541e-68  1.774397e-64
## FBgn0034736 -3.494263 4.186934 262.8782  4.047158e-59  8.440638e-56
## FBgn0029896 -2.432680 5.305827 230.5551  4.511023e-52  8.232053e-49
## FBgn0000071  2.686094 4.795202 221.8658  3.543406e-50  5.747798e-47
## FBgn0034434 -3.619774 3.214994 188.8292  5.726708e-43  8.360422e-40
plotMD(lrt)
abline(h=c(-1, 1), col="blue")

#The total number of differentially expressed genes at 5% FDR is given by:
summary(decideTests(lrt))
##        conditiontreated
## Down                362
## NotSig            13830
## Up                  407
#Compare the Pvalues

trsf = function(x) ifelse(is.na(x), 0, (-log10(x)) ^ (1/6))
ggplot(tibble(pOne = res$pvalue,
              QLF = results$table$PValue),
    aes(x = trsf(pOne), y = trsf(QLF))) +
    geom_hex(bins = 75) + coord_fixed() +
    xlab("DEseq2") +
    ylab("EdgeRQLF") +
    geom_abline(col = "orange")

ggplot(tibble(pOne = res$pvalue,
              Lrt = lrt$table$PValue),
    aes(x = trsf(pOne), y = trsf(Lrt))) +
    geom_hex(bins = 75) + coord_fixed() +
    xlab("DEseq2") +
    ylab("glmLRT") +
    geom_abline(col = "orange")

ggplot(tibble(QLF = results$table$PValue,
              Lrt = lrt$table$PValue),
    aes(x = trsf(Lrt), y = trsf(QLF))) +
    geom_hex(bins = 75) + coord_fixed() +
    xlab("glmLRT") +
    ylab("QLF") +
    geom_abline(col = "orange")

References

Brooks, Angela N, Li Yang, Michael O Duff, Kasper D Hansen, Jung W Park, Sandrine Dudoit, Steven E Brenner, and Brenton R Graveley. 2011. “Conservation of an Rna Regulatory Map Between Drosophila and Mammals.” Genome Research 21 (2). Cold Spring Harbor Lab: 193–202.

Hu, Yu, Jennie Lin, Jian Hu, Gang Hu, Kui Wang, Hanrui Zhang, Muredach P Reilly, and Mingyao Li. 2018. “PennDiff: Detecting Differential Alternative Splicing and Transcription by Rna Sequencing.” Bioinformatics 34 (14). Oxford University Press: 2384–91.